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ABSTRACT 

I present a new framework for estimating a galaxy's gravitational potential, <t>, from its stellar 
kinematics. It adopts a fully non-parametric model for the galaxy's unknown phase-space 
distribution function, /, that takes full advantage of Jeans' theorem. Given an expression for 
the joint likelihood of <t> and /, the likelihood of <t> is calculated by using a Dirichlet process 
mixture to represent the prior on / and marginalising. 

I demonstrate that modelling machinery constructed using this framework is successful 
at recovering the potentials of some simple systems given perfect kinematical data, a situation 
handled effortlessly by traditional moment-based methods, such as the virial theorem, but in 
which the more modem extended-Schwarzschild method fails. Unlike moment-based meth- 
ods, however, the models constructed using this framework can easily be generalised to take 
account of realistic observational errors and selection functions. 
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1 INTRODUCTION 

Inferring the mass distribution of a galaxy from limited information 
on its stellar kinematics is a fundamental problem in modern astro- 
physics. Examples of this problem include measuring the masses 
of central black holes (e.g., van der Marel et al 1998; Siopis et al 
2009) and properties of dark matter haloes (e.g., Rix et al 1997; 
Saglia et al 2000; Thomas et al 2007) in nearby galaxies from mea- 
surements of the integrated line-of-sight velocity distributions of 
the galaxy's stars. Closer to home, surveys of the kinematical and 
chemical properties of vast numbers of stars within our own Galaxy 
are becoming available (see Ivezic, Beers & Juric 2012 for a re- 
view). For example, the Gaia mission (Perryman et al 2001) will 
measure positions and velocities of a sample of ~ 10 9 stars. A 
pressing challenge is to use such kinematical and chemical snap- 
shots to constrain the full dynamical structure of the Galaxy, in- 
cluding its distribution of dark and luminous matter. The discrete 
nature of the data available for our own Galaxy means that most 
methods developed for external galaxies do not work. 

The problem I address in this paper is the following: given 
some discrete stellar kinematical data D, how to calculate the like- 
lihoods pr(£>|<I>) of a range of different potentials <!>? I make the 
usual assumptions that the galaxy is collisionless and in a steady 
state. To keep the discussion focused, I further assume that the 
galaxy has a single, chemically homogeneous population of stars. 
Then the galaxy is completely described by just two unknown func- 
tions: the gravitational potential <J>(x) and the distribution function 
(hereafter DF) /(x,v), which is the probability density of stars in 
phase space. For most of the paper I focus on an idealised situation 
in which the data D represent an unbiased sample of the galaxy's 



stars for each of which we know (x, v) precisely. Then the problem 
becomes one of inferring <I> given a random realisation of /. 

The most obvious way of tackling the idealised problem is by 
applying moment-based methods, such as the virial theorem. Un- 
fortunately, moment-based methods are not easy to extend to the 
general case of imprecise measurements with complicated selection 
effects. My motivation for the present paper was to come up with 
a coherent alternative to the virial theorem and its variants that can 
naturally be extended to allow the computation of pr(£>|<I>) in more 
realistic scenarios. Recognising that we have - at best - only a ran- 
dom realisation of the DF /, I follow Magorrian (2006) in treating 
/ as a nuisance function that is to be marginalised out: pr(£)|<I > ) is 
obtained by integrating the well-defined joint likelihood pr(£)|<I>, /) 
over all possible /, the contribution from each / weighted by a 
prior that must satisfy certain consistency conditions. Crucial help 
is provided by Jeans' theorem (Binney & Tremaine 2008, hereafter 
BT08), which asserts that the DF can depend on (x, v) only through 
integrals of motion in the potential <S>. In particular, when expressed 
in terms of action-angle coordinates (J, 9), the DF must be uniform 
in 0: f = /(J). From the statistics and machine-learning commu- 
nities I borrow the idea of using a Dirichlet process mixture (e.g., 
Teh 2010) to model the prior distribution on the DF. In effect, the 
DF is modelled as a distribution of an arbitrary number of blobs of 
arbitrary size and shape in action space with a suitably chosen prior 
for the distribution of blob locations, shapes, sizes and weights. 

The paper is organised as follows. Section 2 uses a toy one- 
dimensional problem to underscore some of the shortcomings of 
existing methods for computing pr(Z?|<3?), particularly for the ide- 
alised case of perfect (or very good) data. Section 3 sets out the 
core ideas of my proposed solution. It introduces the idea of a 
Dirichlet process mixture and explains how, by treating the dis- 
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tribution of DFs as such a mixture, one can calculate pr(Z)|<5) by 
marginalising the joint likelihood pr(£>|S>, /) over /. The technical 
details of two different schemes for carrying out this marginalisa- 
tion are relegated to appendices. Section 4 demonstrates that this 
idea works by applying it to some simple test problems, starting 
with the toy one-dimensional problem of Section 2. The relation- 
ship between the Dirichlet process mixture method proposed here 
and some other potential-estimation methods is discussed in Sec- 
tion 5. Finally, Section 6 sums up, explains how the method can be 
extended in a natural way to take account of selection effects and 
measurement errors, and identifies future work. 



o.o 

X 



2 A TOY PROBLEM 

Consider a one dimensional galaxy in which stars move in a poten- 
tial 4>(jc) = jC0 2 x 2 and have some unknown distribution function 
f(J), where the action J = co(j 2 + v 2 /co 2 )/27t. Given a sample, D, 
consisting of the positions and velocities (x*,v*) of N stars drawn 
from this galaxy, what constraints can we place on CO? In particular, 
what is the posterior probability distribution pr(co|D)? I assume that 
there are no selection effects - the sample D is a fair representation 
of the underlying DF - and that we know the N stars' positions and 
velocities precisely. 



2.1 Virial theorem 

The virial theorem provides an effortless solution to this problem. 
The DF / satisfies the collisionless Boltzmann equation, 



dt dx 



d<t> df 

dx dv 



-0. 



(1) 



Assume that the galaxy is in a steady state, so that 3//oY = 0, and 
that / tapers off smoothly to zero for large \x\ and |v|. Multiply- 
ing (1) by xv, integrating over the (x, v) phase plane and rearranging 
gives 



9 f fv dxdv 
co = J 



/ fx 2 dxdv 



(2) 



which, as our N stars provide a fair sample of /, can be estimated 

as 



CO" ~ CO: 



VT : 



yN *2 

Y N x* 2 



(3) 



This approach is straightforward and to the point, but it suffers from 
the following drawbacks. 

(i) Going from (2) to (3) involves estimating integrals over the 
DF by taking appropriately weighted sums of the observed particle 
distribution. These estimates ignore the strong constraint on the DF 
provided by Jeans' theorem: when viewed as a function of action- 
angle coordinates (J,Q) instead of (x,v), the DF / = /(/) must be 
uniform in angle. Therefore, although C0vt — > co in the limit N — > °°, 
we should be able to do better for finite values of N. As an extreme 
example, given a sample of, say, N = 4 stars that all happen to lie 
exactly on an ellipse x* 2 + v* 2 /cOq = 1 for some cOq, it is more 
plausible to believe co = coo over whatever estimate covt provides. 

(ii) Apart from some special cases (e.g., An & Evans 2011 
and references therein), there is no general way of modifying the 
moment-based estimate (3) to take account of uncertainties in the 
measured coordinates (x*,v*). For example, in the Milky Way one 



Figure 1. A sample of N = 10 stars drawn from a one-dimensional toy 
galaxy with a simple harmonic oscillator potential = jd^x 2 in which 
(Do = 1 . The underlying DF of the model is a uniform distribution in ampli- 
tude a between a = 0.9 and a = 1, where a 2 = x 2 + v 2 /c0o- 



typically has only very crude estimates of the distances to indi- 
vidual stars, which in turn affects the estimate of their transverse 
velocities from their proper motions. 

(iii) Real stellar catalogues rarely provide a fair, unbiased sam- 
ple of the DF underlying the galaxy. Observations are inevitably 
subject to some selection function S(x,v), which gives the proba- 
bility that a star at (x, v) would be included in the sample. Although 
it is possible to extend the analysis above to use an "selective DF" 
/s(x,v) = 5(x, v)/(x, v), the results are dominated by any sharp 
features in 5(x,v). In other words, they are strongly affected by 
what is happening at the edges of the survey, which is worrying as 
one rarely knows 5(x,v) well. 

2.2 The modern approach: maximum-likelihood orbit-based 
models 

The most flexible modern scheme for estimating the potential is 
the so-called "orbit superposition" or "extended-Schwarzschild" 
method and its variants (see Chaname, Kleyna & van der 
Marel 2008 for an application to discrete kinematics). These work 
by considering a range of explicitly chosen trial potentials <3>. For 
each such <I> they: 

(i) Represent the DF as a weighted sum /(J) = Y,k w kfk(3) °f 
basis functions /*(J) that depend only on integrals of motion in 
the assumed <JP. This ensures that Jeans' theorem is satisfied. The 
simplest way of constructing such a basis is to take a representa- 
tive sample of single orbits in <I> (Schwarzschild 1979). The next 
simplest is to represent each by a bunch of neighbouring orbits. 

(ii) Calculate the contribution P„i = pr((x*, v*)|/jt,<I > ) that each 
basis element makes to each observed datapoint, including the ef- 
fects of any observational uncertainties. 

(iii) Find the set of weights w; that maximises the likeli- 
hood pr(£>|{w£},<Jp). For the present problem this likelihood is 
Yin Hk^nk w k an d is subject to the constraint that Y,k w k = !• 

(iv) Take this peak value of the likelihood as the likelihood 
pr(D|<I>) of the trial potential <I>. 

Let us see how well such a scheme works when applied to our 
toy one-dimensional problem. Figure 1 shows a sample of N — 10 
stars drawn from a galaxy with C0q = 1 and the annular top-hat DF 



f(J) = 



A, if 0.9<o(jc,v) < 1, 
0, otherwise, 



(4) 
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basis functions fa to some properties of the available sample D. 
There are many arbitrarily ugly, ad hoc ways of doing this, but I 
note that the solution proposed in the following section has a side 
effect of picking out a basis of just the right resolution in a natu- 
ral way (e.g., Escobar & West 1995). Fundamentally, however, the 
problem with the maximum-likelihood procedure above is that it 
looks only at the distribution of orbits that produces the best possi- 
ble match to the observed sample, with no regard for nearby orbit 
distributions that are only slightly less likely. 



2 8.3 1.0 1.5 2.0 2.5 3.0 

U> 

Figure 2. The likelihood pr(D|co) for the sample of 10 stars shown in Fig- 
ure 1 calculated using the standard maximum likelihood-based algorithm 
by adding nominal Gaussian error circles of standard deviation A around 
each observed (x,v). The results for different choices of A have been offset 
for clarity. As A — > the models cannot distinguish one potential from an- 
other. For comparison, the heavy vertical red line indicates the estimate (3) 
of CO obtained by using the virial theorem. 

where A > is an uninteresting normalisation constant and the am- 
plitude a(x,v) of an orbit passing through (x,v) is defined to be 



which for visualisation purposes is more convenient to use than the 
action J when labelling orbits on the phase plane. To illustrate the 
effects of observational uncertainties, I assign nominal Gaussian er- 
rors of standard deviation A to each (x*, v*) and consider the effects 
of shrinking A towards zero. 

I use the four-step modelling procedure above to assign a like- 
lihood pr(£)|co) to each of a range of trial values of CO. My basis 
for each is the set of abutting annuli fa{x,v) = constant for am- 
plitudes a| < x 2 + v 2 /cu 2 < zero otherwise. There are 200 
such annuli, running from a\ = to 0201 = 2 with uniform spacing 

— = 0.01. The contribution P,,^ that the £ th such annulus 
makes to the probability of observing the ?i th star is simply the in- 
tegral of fk(x,v) times a Gaussian of width A centred on (x*,v*). 
Having calculated these P^, I use the expectation-maximisation 
algorithm to find the set of weights that maximise the likelihood 
subject to the constraint that Y,k w k = 1 

The resulting plot of maximum likelihood versus assumed co 
for this dataset is shown on Figure 2. When A is large, this pro- 
cedure produces a likelihood distribution pr(£>|co) that peaks close 
to the correct value of a>o = 1 . Perversely, however, the likelihood 
flattens as A shrinks: if the data become too good, the model is un- 
able to distinguish one potential from another! It is easy to see why 
this is: given perfect knowledge of (x*, v*) there is a unique orbit in 
(almost) any given potential that passes through this (x*, v*) and no 
other; it is only when several stars lie along an orbit that one can say 
anything about the likelihood of the assumed potential. Therefore 
all potentials have the same likelihood. 

2.3 Comments 

What to do about this? One remedy is to consider only strongly 
parametrised forms for the DF or to take a non-parametric DF and 
impose some form of regularisation (e.g., Merritt 1993), but this 
has the disadvantage of introducing hard-to-understand biases in 
pr(Z)|<5). A related idea is to somehow couple the resolution of the 



3 A SOLUTION: MODELLING THE DISTRIBUTION OF 
DFS AS A DIRICHLET PROCESS MIXTURE 

Here is a more general restatement of the toy problem above. 
We have a galaxy with unknown potential <J>(x) and unknown DF 
/(x,v). We are given a list of the phase-space locations of 
N stars drawn from the galaxy. We may assume that the galaxy 
is in a steady state and that the list of stars is a fair sample 
of the underlying DF. Our job is to constrain the potential from 
these data D. In particular, we seek the posterior probability dis- 
tribution pr(<t>|D). By Bayes' theorem pr(4>|Z>) is proportional to 
pr(0|<E>) pr(<J>), where pr(<I>) is our prior on <I>. 

As the galaxy is in a steady state, it is natural to express the 
DF in terms of action-angle coordinates (3,9) instead of (x,v): by 
the strong Jeans theorem, the DF is a function /(J) of the actions 
only (BT08). Let d = 1, 2 or 3 be the number of dimensions in 
the system. If d = 3 then we may take J = (J r ,jQ,J§) in which the 
radial action J,- and the latitudinal action Jq must be non-negative. 
The azimuthal action J§ can take either sign. Similarly, for d = 2 
we have J = (J r ,J&) in which ^ 0, while for d = 1 the single 
action J ^ 0. The likelihood pr(D|4>) can be expressed in terms of 
the stars' actions Jj...Jjy as 

pr(D|*) = f d rf Jid^pr(x{,v*|j;,0|,<I>)... 

/ (6) 
J d d r N d d 9* N pr(x^|J^,*) • pr(J}.-.J^), 

where denotes some as-yet unstated assumptions (which are 
summarised in §3.4 below) and - for the present idealised problem 
of perfect data - each pr(x*, v*| J*, (9*,<I>) is simply a Dirac delta 
that picks out the J* corresponding to (x*,v*) for the assumed <I>. 
The only place that the potential enters into this problem is in the 
conversion from (x*,V*) to (J*, 0*); when expressed in terms of 
the actions, the likelihood pr(Jj...J^|j3) is independent of <E>. 

The DF does not appear explicitly in the innocuous-looking 
expression pr(J^...J^|J?) because it has been marginalised out: we 
have that 

pr(J|...J^) = J pr(J|...J^|/,^)pr(d/|^) (7) 

which involves summing the likelihood pr(Jj...Ji|/, M) over all 
DFs /(J) that satisfy the uniform-in-angle constraint imposed 
by Jeans' theorem. There remains the choice of prior measure 
pr(d/|J?), which is a distribution over distributions. A standard 
way of choosing this, well known from the statistics and machine- 
learning communities, is to model the DF as being drawn from a 
Dirichlet process mixture: essentially, /(J) is expressed as a sum of 
an arbitrary number of blobs in action space, the blobs having some 
distribution of locations, sizes, shapes and probability masses. The 
marginal likelihood (7) is obtained by marginalising the parameters 
that describe the blobs. This basic idea is explained more precisely 
below, with further discussion postponed until Section 5. 
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In the following let V be a large, but finite, volume of action 
space (Jb ox ) d that includes all of the J* and let H be a measure on 
this space. I take H to be proportional to the canonical (2n) d & d 3 
phase-space volume, normalised so that H(V) = 1. 



3.1 Dirichlet distribution 

Consider an arbitrary partition P of action space V into some num- 
ber of cells K, Let V* be the subvolume enclosed by the k th cell 
and let Tty be the associated probability mass: that is, is an in- 
tegral over the unknown DF within V^. As the DF is unknown, we 
may treat 7T = (jli , ...,%k) as a list of random variables. Clearly the 
7l£ must satisfy the conditions 71* ^ and £f =1 Tt^ = 1. Let us also 
make the simplifying assumption that the Jtt are independent of one 
another. 

What prior probability distribution can we take for the 
There are many ways of partitioning V, but the priors associated 
with different choices of partition (Vi,...,Vk) must be consistent. 
In particular, suppose we construct a new partition P' by splitting, 
say, the first two cells of the original partition P, so that P' has 
K+l cells with volumes {V v ,Vi» ,V 2 ,...,V K ) and V v UVy> =Vy. A 
natural condition to impose on the prior on the is that the total 
probability mass %y +Kyn associated with the first two cells Vy and 
Vi« in P' should satisfy (e.g., Magorrian 2006) 



pr(jti|P)= f d7Ti'pr(7li/,7ti» = Jti — 7li'|/ y ). 
JO 



(8) 



This must hold for any partition P of V into any number of cells 
K > 0, which, together with the assumption of independence of the 
7^, means that pr(7rj Ji) must be infinitely divisible (Feller 1971). A 
particularly simple prior that satisfies these conditions is the Dirich- 
let distribution, for which the probability density function is 



t D(ir\a) 



K 



K 



8| 1-J> C(a)n^' +at , 



(9) 



with free parameters a = (0Ci, ...,(Xk), with each > 0, and the 
normalising constant 



C(a) 



(10) 



where T(a) is the usual Gamma function. It is easy to confirm 
that (9) satisfies (8) as long as we take Cfy proportional to the vol- 
ume measure //(Vi) of the cell. 



3.2 Dirichlet process 

The consistency condition (8) means that we may restrict our at- 
tention in the following to priors defined on a large number K of 
identical, very small cells that differ only in their locations Jj. in 
action space. As the cells have identical volumes, they must also 
have identical values of a^. So, let us take = a/K and consider 
the limit K -> oo (Neal 2000, Rasmussen 2000). 

Any partition of V into a finite number of nonempty, non- 
overlapping subvolumes, (Vi , Vl), can be represented by group- 
ing together these tiny, equal- volume K cells; each of the K cells 
will lie inside precisely one of the V/. Let F(V/) be the probability 
mass associated with V), which for the present problem is just the 
value of the DF integrated over V) : 



(2nf [ 
Jv, 



so that F(V[) is just the sum £#31* restricted to cells k that lie in- 
side V). Using the consistency property (8) of the Dirichlet distribu- 
tion (9) together with the choice oc(V/) = aH(V/), it is easy to show 
that for any such partition (Vi , . . . , Vl) of V we have that 

(F(Vi),...,F(V L )) ~ 0(aff(Vi),...,aff(V L )), (12) 

where the ~ sign here means "is distributed as". This is the defin- 
ing property of a Dirichlet process (Ferguson 1973; see also Teh 
2010 for a brief, accessible introduction). A Dirichlet process has 
two parameters. One is the base measure H, which I take to be pro- 
portional to the canonical volume element (2K) d d d J. The other is 
the concentration parameter a, which controls the dumpiness of 
the distribution: the expectation value of F(Vi) is just H{V{) and 
the variance H(Vi)(l — H(Vj))/(a + 1). As a increases the vari- 
ance shrinks: all the probability mass tends to concentrate onto a 
single point somewhere in action space. 

To understand more about the properties of draws from a 
Dirichlet process and the effect of a, let us return to the picture of 
the limit of a large number K — > °° of equal- volume cells and sup- 
pose we draw N stars from the distribution (9). Let c; e {1, ...,K} 
be the cell number of the i* star. Clearly, pr(c, = c) = n c : the 
probability that the i th draw picks cell c is just % c . Marginalising 
7T = (7Ii , ...,71^) with the prior (9), the probability of drawing star 1 
from cell c\ , star N from cell qv is 

pr(ci...c N ) 

T(a) f ( £ \ -l+a/K -1+a/K 

= / d7T0 1 — > 7t; 7t e , • • • Jt Cw Jt, • • • Tlr 

r(a 



^ [Tr [n k (N) + - 



(13) 

where n^N) is the number of stars in cell k for this draw of N stars. 
Therefore the conditional probability 

pr(ci...c N ) n k {N) + { 



pr(cjv = c\c U -,c N -i) 



(14) 



fd d 3, 



(11) 



pr(c 1 ...c A ,_ 1 ) 2V-1 + 0C 
So, the first star is equally likely to come from any of the K cells. 
In the limit K — > °o the second star has probability 1 / ( 1 + a) of 
coming from the same cell as the first star. The remaining proba- 
bility cc/(l + a) is spread equally among the unoccupied cells. Star 
Af has probability n/(N — 1 + a) of coming from a cell that already 
holds n stars. The probability that it does not come from a cell oc- 
cupied by any of the previous TV — 1 stars is (x/(N — 1 + a). The 
same behaviour can be derived directly from the more abstract def- 
inition (12). When N is large the expectation value of the number 
of non-empty cells tends to alog(l +N/a) (e.g., Antoniak 1974; 
Teh 2010). 

This argument shows that the probability distribution of the 
stars is discrete: the DF will consist of a series of isolated spikes. 
Neighbouring parts of action space do not "know" about each other. 
Clearly then the marginal likelihood (7) is independent of how the 
J* are distributed, unless two or more of them happen to overlap 
precisely. Therefore pr(D|<I > ) is flat, just as we found for the models 
in Figure 2. In fact, any prior that satisfies the infinite-divisibility 
conditions will produce spiky, discrete DFs (Kingman 1993) and 
therefore will suffer from this problem. 

3.3 Dirichlet process mixture of blobs 

In order to give the prior on the DF some notion of continuity, 
let us smear out the probability mass % associated with the k th 
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cell around the cell's location J* with density proportional to some 
function Blob(J*|J^, A k ), where A k describes the size and shape 
of the blob. Then the DF becomes 



/(J) = £>Blob(J|J t ,A t ) 



(15) 



with Jt£ drawn from the Dirichlet distribution (9) with a k — a/K. 
The parameters itj, J k and A k of the blob associated with each 
cell are completely independent, save for the fact that Y,k n k = 1- 
It is perhaps helpful to think of the DF (15) as representing the 
galaxy by a sum of its progenitor stellar clusters in phase space, the 
tidal debris from each cluster smeared out by two-body encounters 
and other relaxation effects, but I emphasise that the blobs are fun- 
damentally purely formal devices used to introduce neighbouring 
parts of action space to one another. 

A convenient way of representing each blob would be by us- 
ing a single Gaussian, but recall that at least one of the components 
of J is constrained to be non-negative and yet we want % k to be 
the total probability mass of the blob. This means that the func- 
tion Blob(J| Jfc, At) must have unit mass when integrated over the 
physically allowed region of action space. To ensure this, I take 



Blob(J|J t ,il*) = £ *L{Wmh,K 

m— 1 

in which fA£ is the usual normal distribution, 



^(x|x,A-') ; 



\A\ 1 / 2 
{2n) d l 2 



exp 



(x-x) T A(x-x) 



(16) 



(17) 



where \A\ is the determinant of the precision (i.e., inverse covari- 
ance) matrix A, and R i , . . . , Rm are reflection operators that produce 
mirror images of the Gaussian at J = Jj.. For the case d = 3 in which 
J = (J r ,Je,Jty) there are M = 4 such operators: 



Ri =diag(+l,+l,+l), 
« 2 = diag(-l,+l,+l), 
« 3 =diag(+l,-l,+l), 
fl 4 = diag(-l,-l,+l). 



(18) 



These reflect the original Gaussian centred on J = J k about the 
J r — and the Jq — axes so that the total mass of the blob in 
the physically allowed J r ^ 0, Jq ^ subvolume is equal to one. 
Similarly, for the d = 2 case of J = (J r ,J&) the necessary reflections 
arei?! = diag(l, 1), R2 =diag(— 1,1) andford= 1 they are simply 
Ri = +1, if 2 = — 1- Note that, although the first argument J of the 
function Blob(J|Jfc,vl;t) has restrictions on the signs of some of its 
components, we do not need to impose any such restrictions on the 
second argument, J k . Therefore I allow the components of J* to 
take either sign, so that, in the three-dimensional case, any of the 
four choices Jj. = (±J r , ±7e,7^) refer to the same blob. 
In this scheme each cell has a characteristic width 



AJ = 2J box K- l / d 



(19) 



that shrinks as K — > 00. A priori, each star has probability 

pr(J* I discrete) = \/K, (20) 

of belonging to the blob associated with cell k. In this picture the 
blobs are "pinned" to the cell locations. In the continuum limit K — > 
<*> we can forget about these underlying cells and use jr.* and J k to 
refer directly to the probability mass and location of the k th blob, 



pr(Jj.|continuous) = 



the former having prior (12) and the latter 

l/(2/box) d , if all components \J k j\ < J box , 
0, otherwise, 

(21) 

so that pr(Jj- 1 continuous) (AJ) d = pr(J*| discrete). Both ways of 
thinking about J k are useful when deriving expressions for the 
marginal likelihood (Appendices B and C). 

For the prior pi(Ai i ) on the precision matrix A^ I adopt the 
uninformative (e.g., Press 2009) distribution 

pr(A k )=B \A k \-^ d + l \ (22) 

but with a restriction on the range of allowed A k to those that pro- 
duce blobs that are larger than the cell size AJ but smaller than ./box- 
Appendix A gives the details of how I impose this restriction and 
calculate the dimensionless normalisation constant So- 
Given a distribution of stars with actions J*,...,J^,, the likeli- 
hood is then 

pr(Jt...J^|7T,{J, A}) = n £ H*Blob(JS|J t ,il j t) 

n=U=l 
N K M 

n— 1 k—lm—l 

This awkward product of sums can be rewritten as the easier-to- 
handle sum of products, 

N K M z 

P r(ji...j^iz,{j,A})=£nn n tai*mJ*,V) " , 

Z n=lk=lm=l 

(24) 

at the cost of introducing the set of latent variables Z = {z nk m} that 
indicate which of the K x M Gaussians in equation (23) each of 
the N stars comes from: z n km = 1 if star n comes from reflection 
number m of the Gaussian that generates blob k and is zero oth- 
erwise; for each n there is precisely one combination of (k, m) for 
which Znkm = 1. It is then easy to show that marginalising Z from 
the expression (24) with the prior 

N K M 

P r(zi7r) = n n n *t ^ 

n— \ k— \ m— 1 

yields (23). With the DF written in this form, the marginal likeli- 
hood (7) becomes 

pr(Jt..J&|*) 

= E/d7r|dJ 1 .. x |dA 1 ...* 

pr(Jt...J^|Z,{J,yl})pr(Z|7r)pr(7 r )pr({J,A}). 
= E/ d7r / d Ji...tf/ dA i-K pr(JT...J^Z,7T,{J,A}). 

(26) 



3.4 Summary of probability distributions 

Here is a summary of the probability distributions introduced 
above, which serves as a summary of the assumptions A I make 
when calculating the marginal likelihood pr(J*...J^,|J3). The fun- 
damental assumption is that the DF is uniform in angle and can 
be described by a sum of blobs (15) in action space. Then from 
equation (26) the marginal likelihood pr(Jj...J^|J?) is obtained by 
marginalising the "probability of everything", 

pr(Jt...J^,Z,7r,{J,A})=pr(Jt...J^|Z,{J,yl}) 

xpr(Z|7T)pr(7r)pr({J,A}), 
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over the model parameters Z, 7r, (Ji , Ai),...,(Jk, Ar) in the limit 
K — > oo. The likelihood on the right-hand side of (27) is a product 
of Gaussians, 



N K M 



P r(jf..#,{j,A}) = fi n fi [wi«-j*»^t 1 



«= 1 A— 1 m=\ 



(28) 



where the reflection operators R m are given by (18). The priors are 



N K M 

P r(zi7r)=nnrhr> 

«— 1 k= 1 m— 1 

pr(7r) = ©(7r[a ), 
pr({J,A}) = npr(J*)pr(A t ) ; 



pr(Jjfc) 
pr(Ajt) 



rf , if all components j| < /(,<> 
otherwise, 



(2/box) 



(29) 
(30) 
(3D 

(32) 
(33) 



with a.Q — )•••)§)• The variables a, i^ox an d are (degener- 
ate) hyperparameters: a brief inspection of equations (27) to (33) 
shows that the model behaviour is controlled by the single hyper- 
parameter 



a = 



ago 

(2/box) 



(34) 



which has dimensions of (action-space volume) -1 . Following the 
discussion in Section 3.2, the larger the value of a' the larger the 
prior weight given to clumpy DFs that are composed of many blobs. 

The model defined by equations (27)-(33) is a straightfor- 
ward variant of the "infinite mixture of Gaussians" problem that 
is well known in the statistics and machine learning communities 
(e.g., Rasmussen 2000). The only differences are that I have in- 
troduced the reflection matrices R m and, forsaking some computa- 
tional convenience, have imposed explicitly noninformative priors 
on J* and A^. 



4 TESTS 

In this section I present the results of calculating the marginal like- 
lihood pr(£>|S>) given by equations (6) and (26) above for some 
simple test problems, starting with the one-dimensional simple har- 
monic oscillator of Section 2. I make use of two different schemes 
to compute the pr(J*...Jjy|j?) given by equation (26): 

(i) an exact calculation obtained by reducing (26) to a sum over 
all possible partitions of the set of N stars (equ. B10); 

(ii) a variational lower bound obtained by fitting a simple func- 
tional form to the integrand (27). 

Although both schemes are simple to implement in practice, their 
derivations are quite involved and so I relegate the details to Ap- 
pendices B and C respectively. Moreover, practical application of 
the exact calculation (i) is feasible only for small numbers of stars, 
N < 10. In contrast, the approximate variational scheme (ii) is com- 
putationally inexpensive: Appendix C2 provides step-by-step in- 
structions on how to implement it. 




Figure 3. The marginalised likelihood pr(Z)|co) for the sample of 10 stars 
shown in Figure 1 calculated using the exact method of Appendix B (solid 
curves) and the approximate variational method of Appendix C (dashed 
curves). The upper pair of curves are for concentration parameter a' = 
1(T 3 , the lower for a' = NT 1 . The two pairs of curves have been offset 
vertically from one another for clarity. Compare to Figure 2 . 



4.1 One-dimensional simple harmonic oscillator 



In the simple harmonic potential <5(jc 
ated with a star that passes through the point (jc, v) is 

,,2 



5(0 2 x 2 the action associ- 



J(x,v\(0) 



1 

2m 



xdx - 



CO 
2jt 



(35) 



For each of the sample D of N = 10 stars shown in Figure 1 chang- 
ing the assumed CO changes the action J* = J(x*,v*\ai). Figure 3 
plots the marginal likelihood pr(D|co) = pr(Jjf, —,Ju) for a range 
of assumed values of CO using both the exact calculation of Ap- 
pendix B and the variational estimate of Appendix C. 

This plot demonstrates two important points. The first, more 
practical, point is that the variational estimate (Appendix C) of 
the marginal likelihood agrees well with the exact calculation (Ap- 
pendix B), especially for small values of the concentration parame- 
ter a' (equ. 34). This is not surprising, as the variational estimate es- 
sentially approximates the integrand (27) by the contribution made 
by a single well-chosen blob, which will tend to be a good approxi- 
mation for small a. This is important because even for only N = 10 
stars the exact calculation of pr(£)|co) takes a few seconds per CO on 
a standard PC, whereas the variational approximation is instanta- 
neous. 

The second point to note from Figure 3 is that, despite the 
relatively small number of stars, the marginal likelihood is sharply 
peaked about the correct value of CO = 1 . The reason for this is that 
the Dirichlet process mixture used to model the DF favours DFs 
that are strongly peaked in action space. (A quantitative explana- 
tion of this is given at the end of Appendix B.) To illustrate this I 
have constructed realisations of two different simple-harmonic os- 
cillator models. The models have CO = 1 and N = 10 stars drawn 
from a uniform distribution of stellar amplitudes a between some 
a m [ n and a max . For one model I take the same narrow distribution of 
amplitudes, (a m ; n ,a max ) = (0.9, 1.0), as used in Figures 1 to 3. For 
the other I use the broader (a m i n ,amax) = (0, 1). For each realisation 
of each model Figure 4 compares the virial theorem estimate (3) of 
co to the expectation value, 



c»dpm = f pr(£>|co)pr(co)dco, 



(36) 



obtained from the marginal likelihood pr(£)|co) (equ. 6) with an un- 
informative prior pr(co) <>= 1 /to. It is evident from Figure 4 that the 
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Figure 4. Comparison of the virial theorem estimate Q)vt (equ. 3) to the marginal likelihood-based estimate (Bdpm (equ. 36) for many realisations of a simple 
harmonic oscillator model having N = 10 stars distributed uniformly in amplitude a between a m j„ and a max . The panel on the left plots the comparison for 
(a m i n ,a max ) = (0.9, 1). The one on the right is for (0, 1). 



estimate (36) is much better than oOvt m me case of the model 
with the narrow distribution of amplitudes: COdpm 1S always much 
closer the correct CO = 1 than Cflyr- The two estimates are compara- 
ble when the amplitude distribution is broad, however. 



4.2 Solar system 

Bovy, Murray & Hogg (2010) have recently applied a broadly sim- 
ilar technique to constrain the force law in the solar system using 
only a snapshot of the positions and velocities of the eight major 
planets at a specific instance in time, albeit by assuming a paramet- 
ric form for the DF from which the planets are drawn. I follow them 
by assuming that the potential in the solar system is of the form 



*(*) = - 



GM 



a-l)R n 



Y+l 



so that a planet at radius R feels a radial acceleration of 
GM fRo\ J 



R- 



Ri 



(37) 



(38) 



I take ^o = 1 AU, leaving y and M as free parameters. To illustrate 
the application of the method of Section 3 to a system with d = 2 
dimensions I ignore the motion out of the ecliptic plane. Then for 
an assumed (y,M), the two actions associated with an orbit that 
passes through the point (x,y) in the ecliptic plane with velocity 
(v x ,v y ) are 



If 1 f R + 

J R x,v[4>) = — fv R dR = - / 

271 J 71 JR_ 



^(x,v[4>) = — f pjfd$ = L, 



21 1/2 



2(£-*)- 



R2 



dR, 



(39) 



where L = xv y — yv x and E = iv 2 + <5 are the angular momentum 
and energy per unit mass respectively, and R±(E,L) are the corre- 
sponding apo- and peri-helion radii. 

For each assumed (y,M) I calculate the actions (J^ n ,J£ n ) as- 
sociated with the positions (x*,y*) and velocities (v* n , Vy n ) of the 
eight major planets from the 1 April 2009 ephemeris in Table 1 
of Bovy et al. (2010). Figure 5 shows the marginalised likelihood 
pi(D\y,M) = pr({(J^ n ,J^ n )}\A) calculated using both the exact 
method of Appendix B and the variational estimate of Appendix C. 



As in the case of the simple harmonic oscillator, the two meth- 
ods agree well when the concentration parameter a' is small. For 
larger a' the exact pr(D|<I>) becomes implausibly clumpy, which 
cannot be reproduced by the variational estimate. 

The resulting pi(D\y,M) is broadly similar to that obtained 
by Bovy et al. (2010, their Figure 6), who assumed an ad-hoc 
parametrised form for the DF and used a Markov-Chain Monte 
Carlo method to explore the posterior distribution of their DF 
and potential parameters simultaneously. There are some differ- 
ences: I see no evidence of the multimodal structure they found, 
and my posterior probability distribution is slightly tighter, with 
a stronger covariance between y and M. In common with them, 
I find that the model is only marginally consistent with the cor- 
rect result of (y,M) = (2,M ). The marginal likelihood peaks at 
(y,M) ~ (2.02, 1.07M S ): the model slightly overestimates the in- 
ward acceleration felt by the inner planets (including the earth), and 
slightly underestimates the accelerations further out. On the other 
hand, if we believe from Poisson's equation that V 2 <I> ^ every- 
where, then we must impose the prior constraint that y 2 and 
the resulting pr(0|y,M) is strongly peaked very close to the correct 
M = M ;0 value. 



4.3 A simple galaxy model 

A more realistic example is provided by a catalogue of stars taken 
from a snapshot of an equilibrium galaxy model. The toy galaxy 
has a black hole of mass M. embedded in a uniform, spherical dis- 
tribution of dark matter with mass Mo within a reference radius rg. 
The potential is then 



GM. GMq 2 
+ T r ■ 



(40) 



Throughout the following I set the reference radius ro = 1 . The stars 
in the galaxy are luminous test particles with a Hernquist (1990) 
number-density profile, 

1 



PW 



(41) 



and an isotropic internal velocity distribution. To generate the sim- 
ulated catalogue I use Eddington's formula to find the distribution 
function f(E) that produces the number-density profile (41) in the 
potential (40) and then generate mock observations by drawing 
(x* , \* ) for n=l,...,A'=10 4 stars from this DF. The toy galaxy has 
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Figure 5. Plots of logpr(D|y,M) for the solar system problem of Section 4.2. The panel on the left shows the results of the exact calculation with (Appendix B) 
for concentration parameter a' = 10~ 4 yr (AU)~ 2 . The middle panel shows the corresponding variational estimate (Appendix C). The panel on the right plots 
the exact result for a' = 10~ 2 yr (AU)~ 2 . 



1.05 




IJ.94 0.96 0.98 1.00 1.02 1.04 1.06 

M. 

Figure 6. Plot of logpr(D|M.,M i ) for the toy galaxy model of Section 4.3. 
Successive contour levels are spaced at Alogpr(D|M.,M^) = 1. 



M, = Mo = 1, so that tq is approximately equal to the radius of the 
sphere of influence of the black hole. I choose rn = >"o/( 1 + V2), 
which places half of the stars inside rQ. 

Having this catalogue of 10 4 stars I use the variational method 
of Appendix C to approximate the marginal likelihood pr(Z)|cE>) for 
potentials 4>(r) of the form (40) with different assumed (M.,M+). 
For each assumed (M..M+), the actions associated with an orbit 
that passes through the phase-space point (x, v) are 



5 DISCUSSION 

5.1 Comparison to Magorrian (2006) 

The modelling framework proposed in the present paper is a de- 
velopment of the ideas set out in Magorrian (2006, hereafter M06). 
As in M06, the likelihood pr(D|<I > ) is obtained by marginalising the 
DF / from the joint likelihood pr(Z)|<I > ,/) with an infinitely divisi- 
ble prior. The most significant development in the present paper is 
the introduction of blobs: in M06 I imposed condition (8) on the 
DF itself, but here I impose it on the distribution of blob centres 
from which the DF is generated. The blobs are essential: as com- 
mented on both in M06 and in Section 3.2 of the present paper, the 
imposition of (8) directly on the DF results in a flat marginalised 
likelihood pr(D|4>) when the data are too good. 

M06 evaded this issue by considering the problem of cal- 
culating pr(D|<E>) when the data D were realistically noisy, inte- 
grated line-of-sight velocity distributions. This leads to a compli- 
cated joint likelihood function pr(D|4>,/) which introduces strong 
correlations among different subvolumes of phase space. In con- 
trast, the present paper tackles the problem of estimating the poten- 
tial from a discrete, unbiased, error-free sample of the DF, the joint 
likelihood pr(Z)| <!>,/) of which introduces no coupling whatsoever 
between different regions of phase space (apart from those required 
by the strong Jeans theorem). 

The other difference between M06 and the present paper is the 
choice of prior. The infinite-divisibility criterion (equ 8 together 
with the assumption that the Fj are independent) means that the 
Laplace transform of the prior pr(F|oc), 



y r (x.v|*) = ±/v r d,= i£ + 



2(£-4>) 



1/2 



dr. 



7e(x,v|4>) =L-U, 



(42) 



where L = |x x v|, La 



-yv x and E 



<I> are the to- 



tal angular momentum, its projection onto the z axis and the energy 
per unit mass respectively, and r± (E,L) are the apo- and peri-centre 
radii. On an unremarkable standard PC the time taken to do the con- 
version from {(x*, v*)} to {J*} for the full sample of N = 10 4 stars 
and construct the variational estimate was about 1 second for each 
(M,,M+). The resulting marginalised likelihood (Figure 6) peaks 
very close to the correct (M,,M*) = (1,1) parameters from which 
the stars were drawn. 



pr(s|a)= f dFe~ sF pr(F|a) 
Jo 

must be of the form (Feller 1971) 



(43) 



pr(j|oc) = exp 



l-e 



-sF 



-5W(a,dF) 



(44) 



which is completely controlled by the choice of a and the Levy 
measure M(a,dF). In M06 I used the galaxy's luminosity pro- 
file as additional prior information on the DF and set M(a,dF) = 
aFe~ F dF, which is the least informative choice given such prior 
information on the expectation values of F (Skilling 1998). In 
the present paper I avoid using any such extra information and 



adopt the uninformative 9Vf(oc, dF) ■■ 
Dirichlet prior (9). 



= ae dF, which results in the 
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5.2 Sharpness of DF in action space versus uniformity in 
angle 

The models presented here prefer potentials in which the DF de- 
velops sharp features in action space: broadly speaking, the sharper 
the DF, the bigger the value of the marginalised likelihood pr(D|4>) 
(see Section 4 and Appendix B). This is qualitatively similar to 
Penarrubia, Koposov & Walker's (2012) scheme for constraining 
the Galactic potential from tidal streams by looking for poten- 
tials that minimise some estimate of the entropy of the stars in the 
stream. It is instructive to consider why such a "minimum-entropy" 
scheme works. Their scheme estimated the entropy based only on 
the energy of the stars' orbits, but in the following I generalise it to 
use actions. The true entropy 



J[/] = -|/(x,v)log/(x,v)d d xd d v 
= -Jf(J,8)\ogf(J,8)d d Jd d 8 



(45) 



is actually independent of the potential: it would be futile to use this 
S to constrain <!>. Instead, Penarrubia et al's scheme is equivalent to 
taking an orbit-averaged DF, 



/(J) = (2n)~ d J f(J,9)d d G, 



(46) 



which depends on the assumed <3>, and looking at how the orbit- 
averaged S[f] varies as <5 changes. When the correct S> is used, 
we have that f = f and so S[f] = S[f]. As the assumed <I> moves 
further from the correct one, the distribution of angles becomes less 
uniform, but the true entropy S[f] remains unchanged. Therefore 
S[f] must increase 1 and potentials that minimise this S[f] are most 
consistent with having a flat distribution in angle. 

An alternative way of constraining <I> would be to try to mea- 
sure directly how much the distribution of angles deviates from a 
uniform distribution: this is the basis of the "orbital roulette" idea 
proposed by Beloborodov & Levin (2004). It is unclear how to con- 
struct a suitable direct test of non-uniformity though, especially 
for angles 6 that belong to different actions J. In common with 
the minimum-entropy idea of the previous paragraph, the present 
scheme avoids this test by assuming from the outset that the an- 
gle distribution is uniform, / = /(J), and relying on the fact that 
this /(J) becomes "sharper" as neighbouring tori become more 
densely populated when the assumed <!> tends to the correct one. 
The two schemes are not equivalent: an example of a situation in 
which the uniform-in-0 orbital-roulette test differs from the present 
sharpness-in-J one is given in Figure 7. 



6 SUMMARY AND CONCLUSIONS 

My motivation for the work in this paper was to come up with a 
Bayesian alternative to the virial theorem: given a discrete realisa- 
tion of a galaxy's unknown DF, what is the unknown potential in 
which the stars are moving? The result, which is encapsulated in 
equations (6) and (26), follows from marginalising over all possi- 
ble equilibrium DFs, adopting a Dirichlet process mixture model 
for the prior probability distribution on the DF. 

The fundamental assumption of the method is that that galax- 
ies are in a steady state. We know that this is not strictly true, but 



1 An alternative way of showing this is by expressing / in equation (45) 
as the Fourier series /(J,0) = In/n(J)e in 6 in which /_„(J) = ft (J) and 
then Taylor expanding the logarithm in the integrand about /o(J) = /(J). 
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Figure 7. A contrived sample of stars drawn from a one-dimensional galaxy 
model with potential <J>(a) = ^atx 2 with to = 1. The pr(D|co) calculated 
using the Dirichlet process mixture scheme presented in this paper peaks 
very strongly at CO — 1, for which the action-space distribution becomes 
sharpest. On the other hand, the distribution of the stars in angle is flattest 
when co is far from 1 . 



we can reasonably expect most galaxies to be sufficiently close to 
equilibrium that the steady-state assumption is a good starting point 
on which to base more sophisticated time-dependent models (e.g., 
Binney 2005). The next assumption is that galaxy potentials are in- 
tegrate and we can map at will between (x,v) and action-angle 
coordinates (J, 0) given a potential 4>(x). Fortunately, the machin- 
ery for constructing such mappings is already at hand (McMillan & 
Binney 2008 and references therein; Binney 2012; Sanders 2012). 

I have implicitly assumed that constructing this (x, v) <-¥ (J, 0) 
mapping is expensive. In principle, one could attempt to con- 
strain both <5 and / simultaneously, using, e.g., standard Markov 
chain Monte Carlo methods to explore the posterior distribution 
pr(/, 4>|D) of both <5 and /. Then the constraints on <5 would come 
from using the Markov chain samples to marginalise / (see, e.g., 
Bovy et al 2010 who did this for a strongly parametrised / in a 
central potential). Doing this would require that the (x, v) — > (J, 0) 
mapping be constructed anew every time a new trial <I> is proposed. 
Therefore it seems more practical to carry out the marginalisation 
over / for each of a range of fixed trial <&, leading to equation (6) 
forpr(D|*). 

This paper is essentially a proof-of-concept demonstration that 
it is feasible to calculate this marginal likelihood for the idealised 
case of a perfect, error-free snapshot of of the positions and veloc- 
ities (x*,\*) of an unbiased sample of stars from the galaxy. Real 
catalogues have complicated selection biases and large, correlated 
errors on the measurements of individual stars. Neither of these are 
difficult to account for in principle: observational errors just mod- 
ify the pr(x*,vJ|JJ,<?^,<I>) factors in equation (6) from the delta 
functions used in the present paper; selection effects are properly 
accounted for by dividing the pr(Z?|"l>) of equation (6) by the nor- 
malising factor 

/ = J d^vtSj (xl vf) pi(x* u vfl Jt, 0?,*) ■ ■ ■ 

j d d x* N d d \* N s N {x* N y N )pr(x* N y N \r N ,e%,^)- P i(j\...r N \A), 

(47) 

where the selection functions S„(x*,\*) give the probability that 
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the n star would have made it into the catalogue had it had phase- 
space coordinates (x*,v*). 

For simplicity I have assumed that the stars in the galaxy are 
homogeneous. If the sample D splits cleanly into, say, two chemi- 
cally distinct subpopulations, D\ and £>2, then each population has 
its own independent DF and the marginal likelihood for the full 
sample is simply pr(Z)|<I > ) = pr(D[ |4>)pr(D2|^ > )- The more general 
case in which the types of individual stars are ambiguous can be 
dealt with by extending the latent variable z n km introduced in equa- 
tion (24) to indicate the parent DF of each star. 

The next step is to apply the ideas presented here to a more 
realistic problem. I have explained how to write down a coherent 
expression for the marginalised likelihood pr(D|4>) when the sam- 
ple D suffers from real observational errors and selection biases. 
The single biggest remaining challenge is to find a viable numeri- 
cal scheme for approximating the resulting expression: it is unlikely 
that the simple variational method of Appendix C would be directly 
applicable in such cases. Fortunately, there has been much work on 
more sophisticated approximation methods (e.g., Neal 2000, Kuri- 
hara et al 2007 and subsequent developments). I hope to report on 
an application of the ideas presented here to real data in the not- 
too-distant future. 
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APPENDIX A: THE TRUNCATED WISHART DISTRIBUTION 

This appendix explains how I truncate the improper, uninformative prior (22) introduced in section 3.3, 

pr(A k )=B Q \A k \- i 2( d+l \ (22) 

to eliminate blobs that are "small" compared to the cell size AJ and "large" compared to the action-space volume (2J\ )ox ) d . The truncated 
Wishart distribution introduced here (equ. A3) reappears in the calculation of the marginal likelihood (26) described in Appendices B and C 
below. 

A well-known distribution that is similar to (22) is the Wishart distribution (e.g., Press 2009), which has density 

^ (^|W,v)oc|yl|i( v - d - 1 )exp[-^tr(W- 1 A)] , (Al) 

controlled by the two parameters W and v. Comparison of (22) and (Al) suggests that one way of truncating the uninformative (22) is 
by taking pr(A^) = Wo(Wo,Vo) with Vo — > and Wo = WqI, where / is the identity matrix and Wo is related to the cell size AJ through 
Wq = (A/)~ 2 . This Wo (Wo, Vo) is directly proportional to the uninformative prior (22) for blobs that are large compared to AJ (i.e., for which 
tr(Wg 1 Ak) <t: 1). A snag is the Wishart distribution is normalisable only for v > d — 1 . To remedy this I take 

pr(40 = W(A k \W ,V ), (A2) 

where I define the truncated Wishart distribution W(A\W,v) to be that obtained from (Al) by excluding blobs with "volumes" |yl| -1 / 2 
larger than {2J mRX ) d , where 7 max °c J box : 

w(A|w.v) = / s(w ' v)|A| " (v ^ 1)exp ^^ tr(w_lyl) ]' ifi^r 1/2 ;$(2W. (A3) 

1 0, otherwise. 

The rest of this Appendix is concerned with deriving an explicit expression for the normalisation constant S(W,v) that appears in (A3). The 
derivation follows the same lines used to normalise the conventional, untruncated Wishart distribution Wj)(A| W, v) (e.g., Press 2009). 

We may assume that the matrix W _1 is symmetric, so let us begin by choosing a basis in which W _1 = diag(w 1 _1 , ...,w^ 1 ). B(W,v) is 
given by 

= / lAI^-^expI-IttfW-^ldA, (A4) 

where the integral is over all positive-definite symmetric matrices A that satisfy the constraint |yl| -1 / 2 < (2J m . dx ) d . We can express this as 
an explicit jd(d+ 1) -dimensional integral by carrying out a Cholesky decomposition on A, writing it as 

A = TT T , (A5) 
where T is a lower triangular matrix whose diagonal elements are strictly positive, 7},- > 0. Then the determinant, 

d 

|A| = |TT I '| = |T| 2 = nr£ 1 (A6) 
(=i 

is just the product of these diagonal elements. I impose the constraint that |-A| -1 / 2 < (2J m . dx ) d by considering only 7},- > T min , where 
Tmin = (2Jmax) _1 - From (A5) it is not difficult to show that 

tr(W- 1 A)=£wr 1 7; 2 (A7) 
'J 

and that 

dA^^'flTt'^U^j. (A8) 



Then (A4) becomes 



2 d 



(A9) 



S(W,v) 

= 2i JV |w| iv^'^nr ( hv-i+i), 



i=lj=l' 



d I 1 t2 
nun 

2w t 



in which the integral over each of the diagonal elements introduces a lower incomplete Gamma function, T (j(v — i+ 1), T^ m /2\Vi). If 
v > d — 1 and the scale set by the parameter W is small compared to (2J m . dx ) d , then 7^ in /2vv; — > and this reduces to the usual expression 
for the normalising constant of the untruncated Wishart distribution (Al). I set 7 max equal to a few times J\, ox . 
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APPENDIX B: AN EXACT CALCULATION OF THE MARGINAL LIKELIHOOD FOR SMALL N 



In this appendix I show how it is possible to rearrange the expression (26) for the marginal likelihood pr(Jj...J^|^[) so that all of the 
integrals can be evaluated by hand, leaving a (large and unwieldy) sum over partitions of stars among parent blobs. First I identify the first 
few moments of the actions Jf , of the stars that "belong" to each of the K blobs given Z: 



Nk ~ H H Znkm > ^ k ; 



j N M 

TT 22 H z nkmRmJ n , 
k n— 1 m= 1 



N M 



H Z„km(RmJ* n - Jt)(^mJ^- Jt) 7 
* rt— 1 m— 1 



(Bl) 



That is, A'j. is the number of stars that belong to blob k according to Z, Jj. is their mean action and S k the corresponding covariance matrix. 

In equation (26) the blob locations J k are allowed to float freely. I return to the derivation of the Dirichlet process used in §3.2 and set 
up a very fine grid in action space with K — > °° cells and use J k to refer to the location of the k* 1 cell: a blob is attached to each cell, although 
most cells will have zero mass, n k = 0. Then (26) becomes 



pr(Jt..J&|.*) = £ [ d7rpr(Z|7r)pr(7r)n / ^A k pr(Jf..J^|Z,{J, A}) w (A k ) 

Z J k=l 



(B2) 



Taking the probability distributions pr(Z|7r), pr(7r), pr(Jj...J^|Z, {J, A}) from equations (28-32) and pr(A k ) from (A2), then marginalising 
tv using standard properties of the Dirichlet distribution adopted for pr(7r), we have that 



pr(Ji..-W 



-N k ) 



r N M 

j dAtW(A t |w ,v )n n [K{r n \R m h,JK i )\ 



r(a+N) 



Z k=l 



r(f) 



(B3) 



n— 1 m— 1 



The a that appears in this expression is the concentration (hyper )parameter. Writing out explicit expressions for the Normal (17) and truncated 
Wishart distributions (A3) that appear in (B3), the integrand is 



N M 



W(4t|W 0> v ) fl ft [«lU-^')f 



n— 1 m— 1 



B{Wo ^\A K \^o- d -D exp 



(2ji) 



idN k 



-tr(w 1 A k ) - - £ £ z„ im (j; -^ ffl ji) r A Jt (j; -ju t ) 



' «— 1 m— 1 



(B4) 



where B(Wo,Vo) is the normalisation constant (A9) of the truncated Wishart prior for A k . Now use the identities x T Ax = tr(Axx r ) and 
(J*-R m J k ) T A k (J*- R m J k ) = (R m J* - 3 k ) T A k (R m J* - J k ) to complete the square in the argument of the exponential: 



H Z„ km (3*-R m Jk) A k (J*-R m J k )=Y< 52 z nkm^[A k (R m J^-J k )(R m J^-J k ) J 

«-lm-l fi-lffl-1 

N M 

= I £z nkm tr[A k ((R m r n -j k )+(h-h)m m Jn-h)+(h-h)) T \ 

n—l m—l 

= tr [A k N k S k +N k A k (J k - J k )(J k - J k ) 7 



(B5) 



Introducing W^. 1 = W 1 +N k S k , v k = Vo+N k and substituting these results back into (B3), the marginal likelihood becomes 



m(r riD- m yf] r ^ + ^ B{WQ ' Vo) fdA t \A 



kl M-d-l) exv 



tr^W-'+N.Ch-JkHJk-h) 7 ) 



(B6) 



Notice that each of the factors in the product over the k cells is 1 unless N k / 0. So, let us focus our attention on cells that have N k > and 
rewrite the sum over Z as a sum over partitions 2 of the set {Jj, .... J]^} into clusters of stars that belong to the same parent blob (} k ,A k ), then 
summing over the possible locations J k of the clusters within each partition. That is, having "pinned" the cells' locations J k to obtain (B2) I 
now unpin them and write 



K MM 

U\ El-Ill 

Z k=l P mi = l m N =l Ji J„„ 



(B7) 



where, given a partition P of {Jj, .... J]^}, tip is the number of elements (i.e., "clusters") in P, J k is the location of the k lh of the rip clusters 
and the variable m k runs over the M Gaussians that make up the blob associated with each cluster. In writing this sum it is understood that 
the J k are distinct. In the limit K —> °° each £j 4 becomes / d d J k /(AJ) d , where AJ = 2J^, ox K l / d is the cell size, and we have that 



pr(j|...J^) 



r(o) 



n P M 

r (a+A0 -in im E • 



m,v- 



r(g+^)fi(w l)! v ) 
' =i r(f) (27t) ^ 



I 



dA k \A k \2 



(v k -d-l) 



(AJ) 



^ J d^exp -^tr(A,(W,- 1 +^(J,-J t )(j,-J,) r ) 



(B8) 



2 Recall that a partition of a set A is a division into non-overlapping, non-empty subsets of A. 
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The integral over J k gives (2%) d / 2 \N k A k \ l / 2 exp[—jti(W k 1 A k )] provided N k > d \l simply set it to zero if N k < d. Then, restricting the 
sum to those partitions P' in which all clusters have N k > d stars, we obtain 



pr(J}...J^) = 



r(a) yfr f 



• E 



r(|+^) 



S(W ,v ) 



r(f) (ar) 
r(|+^) 



■ d(N t -l)fjd/2^ d 



I 



dAJylJ^-"'- 2 ) 



exp 



B(W ,V ) 



(B9) 



m N =l 



r(f ) (2*)^-i)/vf 2 (A/) d *(w t ,v t - 1) ■ 



where I have used equation (A4) to express the integral over A k as l/B(W k ,V k — 1). Finally, using K = (2J\, ox /AJ) d and T(a/A') — » AT/a as 
S^oowe find 



pr(Jf..Jfc|*) 



r(a) r 
r(a + jV) ^ 



oB(Wo,v ) 



(27, 



box J 



n 



r(#*) 



£=i (2ji) 



-1)^/2 



M 

E 

mi — 1 



i) 



(BIO) 



which is an exact expression for the marginalised likelihood as a sum over all possible partitions P' of the set {Jj, J^,} into clusters of 
N k ^ d stars. Identifying Bq = B(Wq,\q), notice that the quantity in square brackets is just the variable a' introduced in equation (34). 

Given N stars, the number of partitions P in the outer sum of (BIO) is given by the Bell number !%, where Hq = 1 and the ®/v satisfy 
the recurrence relation (see, e.g., Rota 1964) 



w+i 



N 

E 

n=0 



(BID 



For N = 10 stars there are 5?io = 1 15975 such partitions to consider, each of which involves an additional M N choices of m for each cluster! 
This combinatorial explosion renders this exact calculation impractical for realistic values of N. Nevertheless, it is feasible to carry out this 
calculation for N < 10 (see §§4.1 and 4.2), which provides a useful check of less expensive, approximate methods. 

The terms in the sum (BIO) depend on the stars' actions Jj, through the factor [B(W k ,V k - l)] -1 <* |W^|5(^ _1 ) ~ lA^S^I — 2 C^*— !), 
where S k is the covariance matrix of the stars that belong to the A* cluster (equ. Bl). Therefore the marginal likelihood (6) peaks for choices 
of potential that produce the sharpest distributions of stars in action space. 



APPENDIX C: ESTIMATING THE MARGINAL LIKELIHOOD 

This Appendix explains one way of estimating the value of the (log) marginal likelihood 



P(J*) = logpr(J*) = log 



E/// d7rdJdApr(J* ZttJA) 



(CI) 



by using a variational method to find a lower bound. In this and subsequent expressions I use J* without subscripts to stand for the full 
set of the stars' actions {J*,...,J^}. Similarly, J and A without subscripts stand for the blob parameters {Ji,...,J^} and {A\,...,Ak}, 
respectively. 

Introducing another probability distribution Q(liziA\i*), we can use Jensen's inequality to write 



P = log 



E///«z^)g™ 



>Lfff d ^e(z^ir)iogg™|) , L( r) 



(C2) 



Using the product rule pr(J*Z7rjA) = pr(Z7rJyl|J*)pr(J*), it is easy to see that the difference between the true marginal likelihood P and 
the lower bound L is given by the Kullback-Leibler (KL) divergence between Q and P, 

KL(eiip) = -E/// ^dA e( z.jAir)io g ||^, ( C3 ) 

which greater than zero unless Q = pr(Z7rjA| J*). The idea behind variational inference is to find a distribution Q that maximises the lower 
bound L (thereby minimising KL(Q||P)), while simultaneously leaving the integrals in the expression (C2) for L tractable. MacKay (2003) 
explains the origin of this idea from mean-field theories in statistical physics in which the partition function is estimated by minimising a 
variational free energy. My treatment below is an adaptation of that presented in Bishop (2007). 

Bearing the need to have a tractable expression for L, let us restrict our attention to distributions Q of the factorised form 

N K M 

Q(ZirjA\J*) = II IT IT Q^ knm \r)Qn k {n k \r)Q Jt , Ak {h,Ak\r). (C4) 

n—\k—\m—\ 

To keep notation reasonably compact I drop the subscripts and the explicit dependence on J* in these Q factors and use Q(% k ) as shorthand 
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for Qn k {iik\i*) an d so on - A straightforward application of variational calculus shows that the Q(ZirjA) of the factorised form (C4) that 
maximises L is given by 



n k m r r r 

logg(Z) = £ £ £ loge(z fa „„) = / dTT / dj / dAe(7r)e(jA)logpr(J*Z7rjA) + constant, (C5) 

«=lt=lm=l J J J 

loge(Tr) = £ loge(7tt) = £ /dJ ( dAe(Z)e(jA)logpr(J*Z7rjA) + constant, (C6) 

K r 

loge(JA) = £ logG(Jt,^t) = £ / dTT e(Z)e(7r)logpr(J*Z7rjA) + constant, (C7) 
k=\ z ■' 

in which the additive constants are chosen to ensure that each Q factor is correctly normalised and I have made use of the separability of 
the particular form of logpr(J*Z7rjA) in equation (27). Notice that the optimal choices of each of the three factors depends on the other 
two. This suggests an iterative scheme in which, starting from an initial guess for each factor, we cycle through equations (C5) to (C7) to 
update each in turn, repeating until convergence is reached. As L is convex with respect to each Q this scheme is guaranteed to converge. 
One can think of it as a generalisation of the expectation-maximisation algorithm (which in turn is a generalisation of the Richardson-Lucy 
algorithm) in which pointwise estimates of pr(J*Z7rJ/l) are replaced by estimates of its shape. 



CI Expressions for the optimal Q factors 

The integrals that appear on the right-hand sides of equations (C5) to (C7) are expectations of logpr(J*Z7rjA) with respect to different Q 
factors. A convenient shorthand for such expectations is 

E K [f] = J dTV Q(n)f, (C8) 

E Za [f] = £ f dnQ(n)Q(Z)f, (C9) 
z J 

and so on, in which the subscripts to E pick out with which Q distributions the expectation of / is to be taken. Using this notation equation (C5) 
for Q(Z) becomes 

log£(Z) = E wJA [logpr(J*Z7rjA)] +constant. (CIO) 
Substituting pr(J*Z7rjA) from (27) and absorbing all terms that do not depend on Z into the additive constant gives 

loge(Z) = E 7r [logpr(Z| 7 r)] +E JA [logpr(J*|ZjA)] +constant 



N K M 

L L L z "km log Pnkm + constant, 

n— 1 k—l m— 1 



(Cll) 



where 



P„im=E^[ln^] + ^E A [log|A i |]-idlog(2jr)-iE M [(J^-W m J t ) 7 'yl t (J*-^ m J it )] . (C12) 
Therefore the optimal Q(Z) is 



where the quantities 



N K M 
/— 1 k— 1 /;?=! 



r n km - K M , (C14) 

Z.yt'=l L* m '—i Vnk'm! 



known as the "responsibilities", are rescaled versions of p„k m chosen to ensure that Q(Z) is correctly normalised. For later use note that, from 
equation (CI 3), 

MZnkm] = r n km- (CI 5) 

As in equation (Bl) of Appendix B above I introduce the following expressions for the first few moments of the data J*, Jj^ that "belong" 
to each of the K blobs in the model: 

N M j N M j N M 

N k = £ £ r nkm , } k = j- £ £ r nkm R m J*, S k = j- £ £ r nkm {R m J k n -) k ){R m i* n -'i k ) T . (C16) 

Notice that the r nkm depend on the values of the three expectations that appear in (C12), which in turn depend on the choice of Q(tt) and 
Q(JA). I now turn to finding the optimal choices for these two distributions. 
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Applying the same procedure to equation (C6) for Q(tt), we have that 

loge(Tr) = E ZM [logpr(J*Z7rjyt)] 

= logpr(7r) + Ez[logpr(Z|7r)l + constant 

(C17) 

N K M ' 

= logpr(7r)+ Y, L L Ez[z„im]log% + constant. 

n— 1 k—lm—1 

Taking pr(7r) from (30), substituting Ez[z„km] = r nkm f rom (C15) and then identifying the quantity N k introduced in (C16) in the resulting 
sum over r nkm , the optimal Q(tv) is clearly a Dirichlet distribution (equ. 9), 

G(?r) = ©(w|a) (C18) 

in which oc^ = ao +N k . 

Similarly, the optimal choice of the remaining factor Q(JA) is 

loge(jA) = E Z7r [log pr(J*Z7Tjyl)] + constant 

K K 

= E z [logpr(J*|ZJyl)] + £ logpr(J t ) + £ logpr(yl t ) + constant 
k=l k=l 
K N M K 

= L L E Ez[zntm]log^(J^|i?mJ<:,A j(: )+£logW(yl t |Wo,Vo)+constant. 

k=ln=\m=\ ^—1 

Writing out explicit expressions for the normal and Wishart distributions that appear here, gathering together terms involving each 3k and 
using the identities x r Ax = tr(Axx r ) and (J* — R m 3 k ) T A k (3„ — R m ik) = (R m in — ik) T A k (R m 3% — 3k) to complete the square in the argument 
of the exponential (see also equ. B5 above) and simplifying gives Q(3, A) = Flf = i Q(Jk\A k )Q(A k ) with 



fpr(J t ), if N k ^d, 

\#(J*|J*,(#*4t) _1 ), otherwise. (C20) 



Q(h\A. 

Q(A k ) = W(A k \W k ,v k ), 



in which 

-1 _ w-l 



W * = W l +*k&k, 
v k =v () +N k , 



(C21) 



and N k , 3 k and St are the responsibility- weighted moments of the data defined in (CI 6). I note that the expression (C20) for Q(3 k \A k ) is, 
strictly speaking, the optimal choice only for the cases v k <S 1 or v k > d, but it suffices for the following. 



C2 Algorithm for finding the best Q 

Having Q(tt) and Q(3A) we are now in a position to calculate all of the expectations that appear in the expression (C12) for p nkm that 
determines Q(Z). Applying standard properties of the Normal, Wishart and Dirichlet distributions, the relevant results are 



E 



hA k \(3*-R,n3k) T M3n-Rm3 k )] = dN^ 1 + v k (R m 3* - J t ) r W k (R m 3„ - 3k) , (C22) 
log A* = E A [log l^|] = £ ¥ Q(v, + l-0)+rflog2 + log | W*| , (C23) 

logftifc = E w [logJtit] =\|/(ajfc)-vf E a *J ' (C24) 



where the digamma function V|/(z) = dlogT(z) /dz and I have assumed that N k > d. Substituting into (C12) gives, finally, 



- 1 1/2 
Pnkm = rcjAj,' exp 



"X7> ~ v k(Rm3^, - ik) T ^k{Rm3* n - 3 k ] 

2N k I 



(C25) 



An algorithm for finding the optimal Q(Ztv3A) is to alternately (i) update Q(Z) given Q(tv) and Q(3A), then (ii) update Q(tv) and 
Q{3A) given this new Q(Z). More explicitly, these two alternating steps are: 

(i) Having estimates of a k , N k , 3k, W* and v k , use equations (C14) and (C23) to (C25) to calculate the responsibilities r nkm . 

(ii) Plug these r nkm into equation (C16) to obtain updated values for the responsibility- weighted moments N k , 3k an d Sfc. Set a k = Oo+N k . 
Use equations (C21) to update W k and v k . 

I use the K-means algorithm (Bishop 2007) to initialise this procedure. The simplest way of checking for convergence is by examining the 
rate of increase of the lower bound L. 
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C3 Evaluation of the lower bound L 

From equation (C2) the lower bound 



'1 e(Z7rjA) j 

= E[logpr(J*Z7Tjyl)] - E[logg(Z7rjA)] (C26) 
= E ZM [logpr(J*|ZjA)] +E Z7r [logpr(Z|7r)] +£ 7r [logpr(7r)] +E JA [logpr(J)] +E M [logpr(A)] 
- E z [log g(Z)] - E w [log G(w)] - E M [log Q(JA)} . 

The expectations (C26) are easy to work out with the aid of the relations proved above. For example, taking pr(J*|ZJ/l) from (28) together 
with the expectations already worked out in (CI 2) and (C22) gives 

EzM[logpr(r|ZjA)] = \ £ N k {logA, -v k tr(S k W t ) -dlog(2jr)} . (C27) 

Similarly, taking pr(Z|7r) from (29) and pr(7r) from (30) together with (C24) for E^log^] gives 

K 

E Z7r [logpr(Z|7r)] = £ N k \og1i k , (C28) 
k=\ 

K 

E 7r [logpr(7f)] = logC(ao) + (Oo - 1) £ logjft. (C29) 

k=\ 

The other contributions to L are 

E JA [logpr(J)] = -Zdlog(27 box ), (C30) 
E JA [logpr(A)] = £ |logfl(Wo,V ) + V °~*~ 1 logAt - ^v t tr(W 'W t )j , (C31) 

W K M 

E Z [logg(Z)] =ZLL r nkmlogr nkm , (C32) 

/i— 1 £—1 m— 1 

X 

E 7r [loge( 7 r)] = logC(a) + £ (at - l)log«*, (C33) 

k=l 

E M [logg(jA)] = £ E JiAfc pog((2(J t |il*)fi(il t )], (C34) 

E J ^jioge(j t |A t )e(^)] = -//[e(^)] + (7^ 1 °f^ ) ' ifAW ' (C 3S) 

I ^(logAt + rflogiVjt) - ^(1 +log2:t), otherwise, 
H[Q(A k )\ = - logB(W t , v t ) _ V * ~ 1 log At + Ivfcrf. (C36) 



Most of these terms cancel, leaving 



/CfCK )\ X N K M 

L = l0§ ( ~c7^ j + I HI ^mlogr„t m , (C37) 



k=i n—\k—\tn=\ 

N k >d 



in which each blob with N k > d contributes a term 



L k = log ( jjwtVtj ) " 2" l0g A * + 2~ d [1 " l0g ^ " 21 °g( 27b »)] ■ (C38) 

Although it is not immediately obvious, this expression for L is very similar to one of the terms that appear in the sums over partitions P' in the 
exact expression (BIO) for the marginal likelihood. To show this, take C from equ (10) and use the approximation that T{a/K) — > (a/K)^ 1 
when K is large. Then the first term in L becomes 

C(oo) _ r(a) * r(f +A\) r(a) « a _ r(q) /ocn* + * 

^-noT^n^ — ^fTaT^v) n^ w -f(aTA0^ U rm (C39) 

» t ><i iV t ><i N 4 ><i 

as A: — > oo, where K + is the number of occupied blobs with N k > d. The Q factors in the variational Bayes algorithm tend to converge on 
a local maximum of the distribution. The maximum is degenerate, however, as can be seen by permuting the k indices of each blob: in 
general we will have K + blobs with distinct (J k ,A k ) plus K — K+ identical blobs with zero mass. As an approximate way of accounting for 
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these "missing" permutations in the integral (C2), I simply add a term \og(K\ / {K — K + )\) to L. This cancels out the stray K K + factor in 
equation (C39). With this correction, the approximate lower bound on the marginal likelihood becomes 



exp \L + K + log K] = exp 



= exp 



- £ r nkm log r nkm 

nkm 



- £ r nkm logr nkm 

nkm 

aV2, 



r(q) / qg(Wo,v ) 
T(a+N) \ K(2J box )d 

r(«) / ag(Wo,v ) 
T(a + yv) ^ (2J box )^ 



1 



^+ ^ r(iV t ) [^-dependent factors] 



(C40) 



N t >d "k 



B(W t ,v t -l) 



in which I have used (A9) and (C23) to write A k ' B(W k ,V k ) as B(W k ,v k — 1) times some -dependent factors. Apart from these factors 
and a related contribution from the entropic r„ km \ogr nkm prefactor, the result is identical to the contribution made to the exact result (BIO) 
by a single partition P 1 with a specific choice of reflections (ra i , . . . , m np , ) . 

This shows that the variational estimate is good provided: (a) the stars divide cleanly into distinct clusters in action space so that the exact 
marginal likelihood (BIO) is dominated by a single partition P'; and (b) the two-step algorithm given in C2 successfully finds this P' . The 
smaller the value of the concentration parameter a' = aB(Wo, Vq) / (2J mlix ) d , the more likely this condition is to be satisfied. For the purposes 
of the present paper, however, we do not strictly need the estimate to be "good" in this sense; it is more important that the estimate accurately 
captures changes in the marginal likelihood as changes in the trial potential modify the stars' actions {J|(x|, v|| < 5), JJ^(x^, v^,)}. Perhaps 
the most obvious example of a situation in which the estimate (C40) fails is one in which changing the potential changes the number K + of 
distinct clusters. 
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